Integrated main memory and coprocessor with low latency

ABSTRACT

System, method, and apparatus for integrated main memory (MM) and configurable coprocessor (CP) chip for processing subset of network functions. Chip supports external accesses to MM without additional latency from on-chip CP. On-chip memory scheduler resolves all bank conflicts and configurably load balances MM accesses. Instruction set and data on which the CP executes instructions are all disposed on-chip with no on-chip cache memory, thereby avoiding latency and coherency issues. Multiple independent and orthogonal threading domains used: a FIFO-based scheduling domain (SD) for the I/O; a multi-threaded processing domain for the CP. The CP is an array of independent, autonomous, unsequenced processing engines that process on-chip data tracked by SD of external CMD and reordered per FIFO CMD sequence before transmission. Paired I/O ports tied to unique global on-chip SD allow multiple external processors to slave chip and its resources independently and autonomously without scheduling between the external processors.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of, claims priority to: i) PCTInternational Application No. PCT/IB2014/002903, entitled “INTEGRATEDMAIN MEMORY AND COPROCESSOR WITH LOW LATENCY,” having an internationalfiling date of Dec. 31, 2014; and ii) U.S. Application Ser. No.61/922,693 filed Dec. 31, 2013; entitled “MEMORY CHIP WITH PROGRAMMABLEENGINES,” which applications are also incorporated by reference hereinin their entirety.

These and all other referenced patents and applications are incorporatedherein by reference in their entirety. Furthermore, where a definitionor use of a term in a reference, which is incorporated by referenceherein, is inconsistent or contrary to the definition of that termprovided herein, the definition of that term provided herein applies andthe definition of that term in the reference does not apply.

FIELD OF TECHNOLOGY

This disclosure relates generally to the technical fields of integratedcircuits, and in one example embodiment, this disclosure relates to amethod, apparatus and system of network processing and memory storage.

BACKGROUND

A network processing unit (NPU), a.k.a. a packet forwarding engine(PFE), is a an integrated circuit (IC) designed and optimized forprocessing a network packet (packets) that contains header informationcomposed of network address and protocol fields and a user data payload(the data unit at layer 3 of the open system interconnection (OSI)model). The PFE is tasked with performing functions on the header suchas computation, pattern matching, manipulation of certain bits withinthe protocol fields, key lookup (for an internet protocol (IP) address)in a table, etc., for applications such as quality of service (QoS)enforcement, access control monitoring, packet forwarding, etc. inproducts such as routers, switches, firewalls, etc. found on a privatenetwork, e.g., a LAN, or on a public network, e.g., the Internet.

PFE packet processing rates currently exceed the tens of millions ofpackets per second (Mpps). Thus, a substantial amount of data has to beprocessed by the PFE. To cope with this high bandwidth requirement, PFEsutilize multiple-processing cores and multi-threading. The PFE storesdata in, and fetches data from, off-chip memory such as dynamic randomaccess memory (DRAM) chips. This off-chip memory is used to store datasuch as IP addresses for forward hops, traffic management statistics,QoS data, etc. The off-chip memory typically has a memory accesscontroller (MAC) that performs simple operations such as reading datafrom memory and writing data to memory. Operations that are moresophisticated are typically performed by the PFE. Latency is incurred inany transfer of data to and from the PFE because of the processing timerequired to frame and transmit the data in a packet to and from themultiple chip interfaces. Pipelining helps to fill empty cycles, butlatency still occurs.

Using a data cache and/or instruction cache on the PFE chip can helpreduce latency in retrieving data or instructions from off-chip memory,by storing frequently used and prefetched data and instructionstemporarily on-chip. A high-level cache, i.e., L1, is slaved to theon-die processor for the PFE. An on-die cache is not used as a mainmemory for storing a primary source of data from which resources otherthan the processor associated with the cache would then read. Latency isstill incurred sending data back and forth between the on-die cache andoff-chip memory. Because the data stored in the cache is a copy of thedata stored in the off-chip memory, administrative overhead may berequired to maintain coherency of data by synchronizing the copy of datastored in the cache versus the original data stored in one or moreexternal memory devices, such as external buffer memory or external mainmemory. Sometimes an algorithm running on a PFE will repetitively fetchdata stored on main memory for repetitive operations or frequentupdates. If the cache has to be updated for each of these repetitiveoperations, then the fetch from external memory and the write back toexternal memory both incur latency.

Access throughput for many large data structures such as network addresstables does not improve with data caches. The random nature of arrivingpackets from all points of the network, the fine grain nature of theactual data structure, and the sparse diffuse structure can make itdifficult to hold enough of the data structure in the data cache at anyone-time span to make a statistical improvement in performance. This isknown as poor temporal locality quality of the data structure.Therefore, it is often better to reduce the latency to memory byreducing the physical and electrical distance between the processor andthe actual copy of the data structure. Often it is infeasible to put thewhole data structure in on-chip memory of the PFE. However, moving thedata off chip brings back the latency problem.

If a chip has an onboard microprocessor or microcontroller, then manymemory accesses to an on-chip memory are typically processed by themicroprocessor or microcontroller first. Otherwise, a direct access tothe on-chip memory by an external host might alter data in the on-chipmemory on which the microprocessor or microcontroller relies.Additionally, if the microprocessor or microcontroller is configuredprimarily as a special function microprocessor or microcontroller thatdoes not normally access data in the on-chip memory, then an overridefunction may be necessary to enable that microprocessor ormicrocontroller to make a special memory access to the on-chip memory.This may require an interrupt to the memory controller in order to dropcurrent and newly arriving external accesses during the time requiredfor the special memory access to complete its operation.

A PFE can include a complex on-die processor capable of sophisticatedfunctions. The operations required for packet processing can range fromsimple to complex. If a separate coprocessor chip is utilized on aline-card to offload less sophisticated operations from the PFE, thenthe coprocessor has the same latency while fetching and storing data toand from an off-chip memory. If the coprocessor has cache memory on die,then the same coherency overhead arises for synchronizing data betweenthe on-die cache and off-chip memory. Moreover, if data from an externalmemory is shared between two or more other devices, e.g., a coprocessorcache and an NPU cache, then the complexity of the coherency canincrease. Complex process signaling, mutual exclusion protocols ormulti-processor modified-exclusive-shared-invalid (MESI) protocols havebeen developed to facilitate data sharing. Even with these solutionsdeadlock conditions can still occur.

A typical coprocessor is slaved to only one host in order to simplifyaccesses and commands from only one source. If more than one host werecoupled to and communicating with a single coprocessor resource, thentracking and tracing of the source of a command would be required inorder to return the data to the correct requestor. If the sharedcoprocessor resource has multi-threading capability for one or all ofthe multiple hosts coupled to it, then the overhead in managing thethreads can be substantial.

Creating a memory coprocessor with fixed specialized abstract operationsfor a specific application can make the market too narrow, thus makingthe product less economically feasible.

The same design and application concerns mentioned herein also arise forprocessors other than network processors. For example, general-purposegraphics processor units (GPGPUs), multi-core workstation processors,video game consoles, and workstations for computational fluid dynamics,finite element modeling, weather modeling, etc. would involve similarconcerns.

SUMMARY

An apparatus, method, and system for providing an integrated main memory(MM) and coprocessor (CP) chip (MMCC). The chip is a main memory becauseit has sufficient capacity that it does not cache data therein fromoff-chip resources, i.e., an off-chip memory. Thus, the chip avoidscoherency and poor temporal locality issues associated with cachingdata. The chip supports traditional external access like a discrete mainmemory chip. This is done without adding latency from the CP during thememory access. In particular, the present disclosure does not requirethe external access to MM to be processed first by the CP. Additionally,the chip performs local on-chip processing of local data stored in theMM and data received with the command (CMD) without having to crossmultiple discrete chip interfaces in order to fetch data from off-chipsources. This reduces power, latency, and host bandwidth consumption. Inaddition, it isolates the data from cross-process interference.Similarly, the chip supports subroutine call (CSUB) code executed by theCP on-chip, to implement higher level abstract data structure operationsand does not cache instructions from an off-chip memory, therebyavoiding additional latency for extern memory fetches to perform cachefills. The coprocessor (CP) is programmable for performing subroutinecalls (CSUBs) defined by a user on data that is stored in MM or receivedwith a CMD. The chip is a highly efficient niche solution for frequentlyprocessed, short to moderate length CSUB code on high-value datarequiring low latency. The coprocessor interface provides a well-orderedinterface to control access and isolate the underlying nature of thedata from higher-level PFE tasks.

In one embodiment, the apparatus includes a main memory (MM) that isaccessible independently of the coprocessor, and that interleavedlyprocesses both external access calls by a host and internal access callsby the CP. The internal access call does not require an interrupt ofexternal access calls to access MM. Rather, the internal access callscan be load balanced to have a higher/lower/same priority as an externalaccess call. The internal access call has substantially less latency,e.g., up to three orders of magnitude fewer cycles, than an externalaccess call. This is because the internal access call bypasses the chipinterface (I/F), such as SerDes, which incurs latency by having toreceive, recover, decode, deserialize, and drive the data to the mainmemory. On a higher level, if a data fetch is being scheduled by adiscrete processor/coprocessor (PFE) chip from a discrete and separatemain memory chip, then the repeated trips, i.e. recursive operations,that the data has to make between multiple chip interfaces compounds thelatency, including the operations of encoding, packetizing, serializing,and transmitting side of the interface. If the processor/coprocessor isalso responsible for managing the conflict avoidance in scheduling adata fetch, then this consumes valuable processor bandwidth while tasksthat are more sophisticated wait. Additionally, driving large quantitiesof data over external lines, which are longer than an internal chippath, consumes power and subjects the data to noise. The MM ispartitioned in one embodiment to provide pipelined throughput at amemory clock (CLK) speed that is inversely proportional to a system CLKspeed, according to a quantity of memory partitions.

The CP is comprised of one or more discrete, autonomous processingengines (PEs) having a fine-grained processor multi-threaded (PMT) (1cycle/thread) configuration. The PE threading is orthogonal andindependent of the scheduling domain thread (SDT) of the I/O. A CSUB CMDto the CP is associated with an I/O scheduling domain (SD) to ensureFIFO processing vis-à-vis all other CMDs received by the chip. Thus, ifthree CSUB CMDs are received followed by two memory access CMDS for agiven SDT, then the output from the MMCC will be the three CSUB CMDoutput results followed by the two memory access results, assuming eachof these CMDS required data to be returned to the host. Each PE isindependent and autonomous from any other PE. Each PE executes a localCSUB code for each of up to eight processing threads, in the currentembodiment, and relies on the output interface of the chip to reorderthe results per the SDT. Thus, a processing sequence of CSUB code, orsteps therein, can be of any sequence that utilizes the PEs mostoptimally, e.g., highest duty cycle, fastest throughput, besthierarchical priority processing, etc.

The MM and the CP perform their respective functions of memory accessesand subroutine calls independently and concurrently with each other by:i) using partitioned and/or redundant resources; ii) using queues toload balance and/or prioritize the execution of tasks on theseresources, such as the tasks of accessing data, executing CSUB code,storing of data, and the transmitting of results; and iii) using acoarsely grained SDT reference to track CMDs and data through the chip.For example, the MM performs traditional native functions such as read,write, read-modify-write (RMW), etc., while the CP can be configured toperform extensible functions using exemplary CSUB code such as exactmatch, longest prefix match (LPM), search, etc. that are tailored forthe specific application of networking in the present embodiment.Similarly, other subroutine functions for other applications such as arendering function for a graphics application, etc. can be utilized withthe present disclosure. The CP includes a local instruction memory (LIM)for storing the subroutines comprised a sequence of instructions chosenfrom an instruction set architecture (ISA) having instructions such as,hashing, mask-plus-count (MPC), set-assign-compare (SAC), errordetection and correction (EDC), etc. The ISA contains the buildingblocks of executable instructions from which third parties can createnovel, efficient, and differentiated algorithms for a given application.These algorithms are loaded into the LIM of the CP for execution on thePEs. The instructions, subroutines, and overall computing power of theMMCC is a secondary to a more powerful set of instructions, subroutines,and computing power of a host, to which the chip is slaved.

The present disclosures construction of a monolithic memory withcoprocessor chip having higher level abstract operations to access andmodify the data in the structure makes it possible to better isolate andcontrol the integrity of the data while providing higher throughput. Theisolation is similar or analogous to the isolation provided by modernobject oriented languages like C++, Python, etc. whereby data structurescan be created with their own access methods. The coprocessorarrangement allows the device to have a slave style communication thatprovides a well-ordered handoff of control at the hardware level. Thissolution overcomes deadlock limitations that exist in alternativecomputing methods such as complex process signaling, mutual exclusionprotocols or multi-processor modified-exclusive-shared-invalid (MESI)protocols, which have been developed to facilitate data sharing. Thesegeneralized data manipulation solutions do not take into account thespecific structure of the data and thus miss the opportunity to takeadvantage of optimizations inherent in the data structure.

The input interface to the MMCC communicates an external CMD received atits input interface to: i) the MM and the CP in parallel so that eithercan immediately start processing the CMD if they recognize it; and ii) are-order buffer (ROB) that mirrors the sequence of incoming CMDs inorder to effectuate a first-in-first-out (FIFO) protocol for the chip,for both memory access and subroutine calls. An input buffer partitionedper the SDTs will store the CMDs and the incoming data associated withthe CMD in an input buffer according to the SDT. The input interface andthe output interface interleavedly process access CMDs to MM and CSUBCMDs to the CP.

In another embodiment, the MM includes an on-chip memory scheduler (MS)that resolves all bank conflicts locally, i.e., without requiring a hostto consume its valuable processing bandwidth managing mundane bankconflicts. The external access calls to the MMCC are queued per the SDTassigned to them by the host, while internal access calls from the CPare queued in a separate buffer per the SDT from their initial CMDreceived at the MMCC. A FIFO protocol is implemented in any granularityand priority desired, such as globally across all internal and externalaccesses, external accesses across all SDTs, external accesses ofindividual SDTs with round robin across multiple SDTs, internal accesseshighest priority or weighted versus external accesses, etc.

The array of PEs can be flat or can be hierarchically organized into PEclusters (PECs), with one embodiment having four PEs per cluster andeight clusters per chip for a total of 32 PEs, though any hierarchy andquantity can be used. Each PE includes a local instruction memory (LIM)that is slaved to each PE for minimal latency. The CSUB codearchitecture supports branches and jumps but does not support aninterrupts or cache control instructions. This is because the LIM isrequired to have all the CSUB code for which it supports (either a fullset or a subset) loaded therein. Thus, the LIM does not perform cachefetches to retrieve \instructions stored in an off-chip memory. Byfocusing on an application-specific subset of subroutines, i.e.,memory-centric network functions for the present embodiment, the CPbenefits from fast, uninterrupted, and guaranteed processing of data atthe expense of a limited CSUB code size. A CSUB code superset that issufficiently small can be implemented by loading the entire superset inall the PEs' LIMs. A large CSUB code superset that exceeds the sizerestriction of the LIM can be split into multiple CSUB code subsets thatare assigned to multiple different PEs, which when taken togetherfulfill the CSUB code superset. Similar to the LIM, a local data memory(LDM) is slaved to each PEC, or alternatively to each PE, in order toprovide a lowest latency for frequently used master data stored only inthe LDM and not in the MM and not in off-chip memory. An access by a PEto the LDM within its PEC is the fastest access, with an access to anLDM outside its PEC being second fastest, and an access to the MM beingthe slowest on-chip access of data memory. Because all data processed bythe PEs are on-chip, the overhead and complexity of cache fills andspills are not needed to retrieve/send data from/to off-chip memory.Furthermore, because there is no dependence upon the statisticalperformance characteristics of a cache, performance is very predictable.

Each PE will execute CSUBs assigned to it from one or more externalCMDs. Upon receipt of a plurality of CSUBs CMDS at the input interfaceof the MMCC, an aggregator in the CP will classify the CSUB CMDsaccording to a type of the CSUB CMD, e.g., the opcode. The rules for thetype classification of a CSUB CMD are defined by the user/host, thusgiving them flexibility in optimizing the PEs on the chip to the user'sspecific application. A creative user can implement special coding ofCSUB CMDs, etc. to provide for high-priority urgent tasks vs.low-priority bulk tasks, and any desired synchronizing between the two.Once a CSUB CMD has been classified according to its CSUB type, a singleallocator will allocate the CSUB CMD to a PE assigned to that type ofCSUB CMDs, when a thread of an eligible PE becomes available. Note thatthe PE can perform all functions if it has the entire instruction setloaded in its LIM. If the entire CSUB code superset cannot be loadedinto a PE's LIM, then the aggregator will map which PEs can performwhich CSUB CMD opcodes, and will assign the CSUB CMD to the appropriatePE with that CSUB code subset. The PE is limited primarily by theuser-defined allocation of the PE to a given type of CSUB CMD, or by theCSUB code subset loaded into a given PE, as desired and configured by auser. For example, a user can define a high-priority LPM as one CSUBopcode, and define a low-priority LPM a different CSUB opcode. The usercan then either assign more PE resources to the high-priority CSUBopcode, or simply call the high-priority CSUB code less frequently, witheither choice or both together effectively providing a highresource-to-usage ratio.

The CP is reconfigurable on the fly during field operation of the MMCC.This is implemented by the allocator: ceasing to accept subsequent givenCMDS for a given PE; rerouting incoming given CMDs to a secondary sourcePE with the same CMD capability; emptying the given PE's queue byallowing the given PE to process its existing threads to completion;performing an overwrite operation on the LIM of the given PE withupdated instructions received from a host; writing the new CSUB opcodeinto a map table of the allocator; and finally, by starting to acceptCMDs for the updated or new CSUB.

The ROB includes a command queue output buffer (CQOB) and a data outputbuffer (DOB) that are partitioned per the SD and are tied to each otherper the SDT. If an external CMD to access to MM or execute a CSUB CMD onthe CP requires output data, then the CMD is written into the CQOB ofthe ROB, and a respective portion of DOB is reserved and tied to thegiven CMD in the CQOB. If a size of the respective portion of the DOBneeded for a given CMD output is unknown, then writing output data tothe DOB from subsequent CMDs is stalled, even if those subsequent CMDsknow the size of DOB needed for their output data. This is done toguarantee sufficient memory in the DOB for the given CMD in order topreserve the FIFO protocol.

In a system environment, a single MMCC can be slaved to a plurality ofPFEs, without requiring the PFEs to schedule or coordinate betweenthemselves for the MM or CP resources on the MMCC. Each of a pluralityof I/O port pairs is uniquely coupled to each of the plurality of PFEs.Then CMDS from each PFE are processed according to its port pair, whoseintegrity and tracking is maintained inside the chip by associating CMDand data received from each PFE according to their SDT which is globallyassigned considering all ports on the chip, and thus, unique within thechip.

The MM portion of die area in the present embodiment is approximately66%; with approximately 33% of the MM portion dedicated to the memorycell array itself. The resulting net MM portion of die area comprisingpurely memory cells is approximately 15-25% or more of the die area

Another embodiment of the MMCC is a multi-chip module (MCM) comprising:i) the MMCC described above; ii) a high-bandwidth memory (HBM) chip forexpanded memory capacity; and/or iii) a FLASH non-volatile memory (NVM)chip for permanent storage of data such as subroutine instructions.Alternatively, multiple MMCCs can be stacked together with a common busrunning through the depth of the chip to provide through-silicon vias(TSVs) for expanding the memory capacity or CP processing capability.

The MMCC is a heterogeneous combination of various types of memory andlogic on a monolithic device including embedded DRAM (eDRAM), SRAM,eFUSE, and high-speed logic. When considering a MCM, the heterogeneousmemory combination is expanded to include FLASH NVM.

The present disclosure provides a chip with a degree of extensibilityand programmability that allows the device to address multiple markets,thus amortizing the product development and support cost.

The methods, operations, processes, systems, and apparatuses disclosedherein may be implemented in any means for achieving various aspects,and may be executed in a form of a machine-readable medium, and/or amachine accessible medium, embodying a set of instructions that, whenexecuted by a machine or a data processing system (e.g., a computersystem), in one or more different sequences, cause the machine toperform any of the operations disclosed herein. Other features will beapparent from the accompanying drawings and from the detaileddescription that follows. Accordingly, the specification and drawingsare to be regarded in an illustrative rather than a restrictive sense.

BRIEF DESCRIPTION OF THE VIEW OF DRAWINGS

Example embodiments are illustrated by way of example and not limitationin the figures of the accompanying drawings, in which like referencesindicate similar elements and in which:

FIG. 1 is a functional block diagram of a line card in a network system,according to one or more embodiments.

FIG. 2 is a functional block diagram of a multi-chip module (MCM)comprising an integrated main memory and coprocessor chip (MMCC) withlow latency, a flash memory chip, and a high-bandwidth memory (HBM)chip, according to one or more embodiments.

FIG. 3 is a functional block diagram of the memory controller (MC) andmemory scheduler (MS) for main memory, according to one or moreembodiments.

FIG. 4 is a functional block diagram of a main memory (MM) portion ofthe MMCC according to one or more embodiments.

FIG. 5A is a functional block diagram of a programmable engine cluster(PEC), according to one or more embodiments.

FIG. 5B is a functional block diagram of an individual programmableengine (PE), according to one or more embodiments.

FIG. 6 is a functional block diagram of logic function blocks of the PE,according to one or more embodiments.

FIG. 7 is a functional block diagram of a reorder buffer (ROB) formaintaining a FIFO sequence across the I/O scheduling domain, accordingto one or more embodiments.

FIG. 8A is a flowchart of a method for operating an IC with an I/Oscheduling domain in the MMCC, according to one or more embodiments.

FIG. 8B is a flowchart of a method for operating a multi-threadedcoprocessor (CP) in the MMCC, according to one or more embodiments.

FIG. 8C is a flowchart of a method for reconfiguring a PE with a newinstruction set during field operation, according to one or moreembodiments.

FIG. 9 is a case table illustrating an I/O scheduling domain and the PEmulti-threaded domain for a single port, according to one or moreembodiments.

FIG. 10 is a flow-path illustration of multiple commands concurrentlyexecuting on the MMCC to both access MM and to call subroutines in theCP, according to one or more embodiments.

FIG. 11 is a layout diagram illustrating the placement and size of MMcompared to the CP, according to one or more embodiments.

Other features of the present embodiments will be apparent from theaccompanying drawings and from the detailed description that follows.

DETAILED DESCRIPTION

A method, apparatus and system of System, method, and apparatus for anintegrated main memory (MM) and configurable coprocessor (CP) chip forprocessing subset of network functions is disclosed. In the followingdescription, for the purposes of explanation, numerous specific detailsare set forth in order to provide a thorough understanding of thevarious embodiments. It will be evident, however to one skilled in theart that various embodiments may be practiced without these specificdetails.

LIST OF ACRONYMS USED IN DESCRIPTION (PLURAL ADDS LOWERCASE S, ES)

CMD command CP coprocessor CQOB command queue output buffer CSUBsubroutine call DOB data output buffer EDC error detection andcorrection FIFO first-in-first-out I/O input/output IC integratedcircuit IM instruction memory ISA instruction set architecture LDM localdata memory LIM local instruction memory MAAC media access controllerMAC memory access controller MM main memory MMCC MM and CP chip MPC maskplus count MS memory scheduler NPU network processing unit PE processingengine PEC processing engine cluster PFE packet forwarding engine ROBreorder buffer SAC set assign compare SD scheduling domain SDTscheduling domain thread uP microprocessor

Functional Block Diagram

Referring now to FIG. 1, a functional block diagram is shown of a linecard 100 in a network system, according to one or more embodiments. Theline card 100 includes a packet forwarding engine (PFE) 102-1 and anoptional processor 102-2 coupled in parallel to manage differentportions of the network traffic. Optional processor 102-2 can be anetwork processing unit, a special function processor, or aco-processor. The PFE 102-1 and optional processor 102-2 process networkpackets, e.g., Internet packets, for routing, security, and othermanagement functions. This is a task that consumes a substantial amountof processing bandwidth to accommodate high traffic rates of packets.The PFE 102-1 and optional processor 102-2 can be a field programmablegate array (FPGA), an application specific integrated circuit (ASIC), oran application specific standard product (ASSP), etc., that operates onall types of private and public networks, such as a LAN, WAN, SAN, VPN,etc., within a company and out to the public Internet.

The PFE 102-1 includes a microprocessor (uP) 104 coupled to a memorycache block 106 of random access memory (RAM), for storing instructionsor data temporarily on the die of the PFE 102-1 for quicker access thanoff-chip memory storage, i.e. DRAM 113. Scheduler 108 manages accesscalls to DRAM 113 to avoid a conflict, while accessing DRAM 113, e.g.,simultaneously accessing a same memory bank, per rules established bythe DRAM designer. The scheduler 108 adds latency to the packetprocessing functions of PFE 102-1 by requiring PFE 102-1 to generateaccess fetches to off-chip memory, including the resolution of conflictstherein.

The media access controller (MAAC) and framer 120, process networkpackets coming into the line card 100 to ensure proper packaging ofcontrol and data portions of the packet. The PFE 102-1 and optionalprocessor 102-2 then perform the network management functions on thenetwork packet, followed by a traffic manager (TM) block 124, whichregulates the output of packets from the line card to match the networkcapabilities.

Commodity DRAM 113 is utilized liberally in the line card 100 for packetbuffering purposes. For example, when different blocks in the pipelinereach their capacity and stop accepting packets from an upstream block,upstream packets are frequently buffered by off-chip DRAM 113. Movingdata back and forth from DRAM 113 is illustrated as paths AA, BB, CC,and EE. Data is moved from dashed memory locations 123-A, -B, -C, and -Din DRAM 113 to memory locations 123-A′, B′/C′, and D′ (prime) in thefunctional block, MAC/framer 120, PFE 120-1, and TM 124, respectively. Asubstantial amount of power is consumed moving data back and forth fromDRAM. Consequently, any reduction in caching or buffering will helpreduce line card power demand.

One DRAM 113 is slated for storing control data 123-C in a table formatto be communicated back and forth to PFE 102-1 via link CC, to storecache versions of this control data, shown as dashed block 123-C′(prime), in cache memory block 106 of PFE 102-1. While the DRAM 113storage of table data 123-C is more sophisticated than that of thebalance of the DRAMs 113 that simply buffer packets, having to move anydata back and forth between DRAM 113 and PFE 102-1 still potentiallyadds latency to the pipeline. Specifically, the latency arises byrequiring PFE 102-1 to schedule access calls, by requiring DRAM 113 toread the data 123-C, by requiring cache 106 to store data 123-C′, and byrequiring uP 104 and/or scheduler 108 to resolve any conflicts in thememory fetch from DRAM 113 any to resolve any coherency issues betweenthe two versions of data 123-C and 123-C′.

The main memory/coprocessor (MMCC) chip 200, a monolithic device,includes a scheduler 310 coupled to a processor engine (PE) array 500,also referred to as a PE complex, and to a large block of main memory400. PE array 500 provides processing resources to perform a set of CSUBcode and functions on data 122-1 and 122-2 stored in MM 400. Byperforming the set of subroutines and functions locally on data storedin its main memory 400, the MMCC 200 will: i) eliminate transit time andreduce power consumption otherwise required to send the data back to theprocessors 102-1, and 102-2; and ii) increase uP 104 bandwidth for othernetworking tasks by not requiring it to perform subroutines that the PEarray 500 can perform.

Data blocks 122-1 and 122-2 in MMCC 200 are not dashed in theillustration because they are data solely stored in MM 400 as the masterversion of a given type or range of data. In comparison, DRAM 113 storesdata temporarily, which is illustrated as dashed blocks of data 123-A,-B, -C, and D. While PFE 102-1 and optional processor 102-2 can accessdata in MM 400 for specific purposes, they do not access large chunks ofdata transfer back and forth between themselves and MM 400, except forpopulating MM 400 at initialization of MMCC 200 or line card 100. Thus,MMCC 200 eliminates power otherwise required for transferring largeblocks of data back and forth to processor(s) 102-1 and 102-2.Additionally, MMCC 200 eliminates coherency problems that wouldotherwise arise from having multiple versions of data disposed onseparate chips.

Additionally, the two exemplary instances of data 122-1 and 122-2 on asingle MMCC chip 200 can be managed by MMCC 200 for two separate users,i.e., processors, 102-1 and 102-2, respectively. This sharing ofresources, from both MM 400 and PE array 500 resources on MMCC 200 tomultiple processors 102-1 and 102-2, is performed seamlessly andtransparently without requiring the multiple processors 102-1 and 102-2to coordinate between themselves to avoid conflicts while accessing saidshared resources. This is accomplished by slaving the MMCC 200 to thetwo processors via different ports. Namely, MMCC 200 is slaved to PFE102-1 via port A with I/O serial lanes DD and is slaved to optionalprocessor 102-2 via port B with I/O serial lanes DD′. The task oftracking commands and data from the multiple processors 102-1, and102-2, is performed by MMCC 200 via tagging the data and commands with ascheduling domain thread, as described in subsequent figures andflowcharts.

As an example, PFE 102-1 can issue a string of access commands to MM400, including optional memory partition location of data, withouthaving to spend uP 104 bandwidth resolving any possible bank conflictsin MM 400. Additionally, PFE 102-1 can interleave the string of accesscommands with a plurality of network-related subroutine calls to PEarray 500, such as a longest prefix match (LPM) on an IP addresses. Inparallel with these commands from PFE 102-1 to MMCC 200, the optionalprocessor 102-2 can also be communicating access commands and subroutinecommands to MM 400 and PE array 500 of MMCC 200, without coordinatingthose commands with the first processor PFE 102-1. Thus, MMCC 200provides an efficient solution to reducing the high processing demandson the PFE 102-1, while reducing latency of the pipelined processing ofdata packets on line card 100, and reducing power and latency otherwiserequired by transferring data back and forth to the cache 106 of PFE102-1.

While the quantity of ports on MMCC 200 in the present embodiment is two(ports A and B), any quantity of ports can be used, with the quantity ofports equal to the quantity of external processors that MMCC 200 cansupport independently. Thus, a two port MMCC 200 with eight total SDscan independently support two external processors evenly with four SDsper external processor or port. The quantity of SDs can be scaled to anyquantity of ports for a different MMCC design. For example, an MMCC withfour ports, not shown, and 12 scheduling domains could be linked to fourseparate external processor chips, with three scheduling domains perexternal processor.

Referring now to FIG. 2, a functional block diagram is shown of amulti-chip module (MCM) 201 comprising an integrated main memory andcoprocessor chip (MMCC) 200 with low latency, a non-volatile memory(NVM) chip 240, and a high-bandwidth memory (HBM) chip 280, according toone or more embodiments.

The MMCC 200 includes two ports, A and B, shown as IN-A and IN-B, withserial lanes A1-An and B1-Bn, where n can be any quantity of lanes, butis eight in the present embodiment. Input ports IN-A and IN-B arecoupled to a SerDes input interface 204A and 204B that are in turncoupled to the physical coding sublayer (PCS)/framer (FM) blocks 206Aand 206B, respectively. Outputs from PCS/FM 206A, 206B are communicatedvia lines that communicate payload A and payload B from port A and portB, respectively, into memory controller 300 and PE array 500, where f isthe number of scheduling domains per port, and is a global value. In thepresent embodiment, f=4 scheduling domains (SD) per port, with SD 1-4assigned to port A, and SD 5-8 assigned to port B. The memory controller300 will decode the respective scheduling domains associated with eachof the CMDs.

Memory controller 300 includes a plurality of partition controllers302-1 to 302-p, where p is any quantity of partitions and associatedpartition controllers, slated one per memory partition in the presentembodiment. MM 400 is comprised of a plurality of portioned memoryblocks 406-1 to 406-p, where p is any value, but p=4 for the presentembodiment. PE array 500 is comprised of a plurality of PECs 502-1 to502-g, where g is any number as required for computational needs, butg=8 for the present embodiment.

The memory controller 300 and PE array 500 are parallelly coupled to theinput interface, namely PCS/FM 206A, 206B, in order to parallely receiveCMDs arriving on input ports A and B. Additionally, reserve line 211(RSV) communicates the sequence of CMDs and their respective SDsreceived at the memory controller 300 to the reorder buffer (ROB) 700output port OUT-A, OUT-B, to ensure a first-in-first-out (FIFO)processing of data into and out of MMCC 200. The PE array 500 ignoresmemory access CMDs and processes only CSUB CMDs arriving on lines 215-1to 215-c, where c is any bus width. Memory controller 300 ignoressubroutine CMDs and processes only memory access CMDs. The Memorycontroller 300 is coupled to both optional HBM 280 and MM 400 inparallel in order to control accesses to both. In particular, memorycontroller 300 is coupled to MM 400 via lines 213-1 to 213-a, where a isany bus width, to each memory partition 406-1 to 406-p considering thenumber of access requests, and a=8 in the present embodiment for 4 readand 4 writes. PE array 500 is coupled via lines 217-1 to 217-r tocommunicate memory access requests from PE array 500 to memorycontroller 300 while bypassing the input interface, 204-A to -B. Thisdirect access by MMCC 200 saves up to three orders of magnitude ofcycles required for accessing data in MM 400. As compared to a PFE 102-1of FIG. 1 requesting multiple iterations of data from DRAM 113, theon-die PE array 500 can make the same iterative fetches of data fromon-die MM 400, perform the subroutine functions on that data and savethe output data back to memory, with the latency savings beingmultiplied by the number of iterations required for a given CMD.

Outputs from MM 400 destined for PE array 500 proceed directly out of MM400 via lines 223-1 to 223-v, to memory controller 300 then into PEarray 500, as further described in FIGS. 5A-5B. Thus, additional latencysavings are realized by this direct routing between MM 400, memorycontroller 300 and PE array 500. Outputs from MM 400 via lines 219-1 to219-m, where m is any bus width, and outputs from PE array via lines221-1 to 221-k, where k is any bus width, are coupled to results mux 230in parallel. The lines 219-1 to 219-m first proceed to memory controller300 for processing prior to being routed to results mux 230. Results mux230 in turn selectively communicates data to a reorder buffer (ROB) 700via lines SD1-SD(f) and SD(f+1)-SD(2f), according to the schedulingdomain associated with the data output from MM 400 and PE array 500. ROB700 includes an output command queue 708 and a data output buffer (DOB)720 coupled thereto and partitioned per SD. Output ports, OUT-A andOUT-B, are paired with input ports IN-A, IN-B, respectively. Similarly,the output interface of SerDes 224A and 224B are coupled to PCS 226A and226B in a mirror image of the input interface. Output lines A1-Ah andB1-Bj communicate the output results back to the user.

Serial interface 205 is a diagnostic port using any one of a number ofslow speed serial interface standards such as SMBus, I2C, JTAG, etc. forwriting and reading to specific registers on the MMCC 200 for diagnosticpurposes. A debug microcontroller uC 207 is coupled to serial interface205 for receiving commands and returning data. Debug uC 207 cancommunicate with other blocks on MMCC 200, such as MM 400 and PE array500.

Overall, the modular architecture of the MMCC 200 provides a pluralityof parallel flow paths through MMCC 200 both through the MM 400 and thePE array 500 such that no one path is a choke point for data through thechip. The modular architecture also provides for future scalability ofthe chip for greater throughput and data processing.

The NVM chip 240 stores program instructions for subroutines on whichthe PE array 500 executes CMDs. Instructions from the NVM chip 240 areloaded into instruction memory at initialization. Program instructionscan be updated in the field when the MMCC 200 is off-line.Alternatively, program instructions can be updated to NVM 240 andimplemented in MMCC 200 during field-operation while MMCC is operationalwith PFE 102-1. This is possible because of the modular architecture ofthe MMCC 200, as will be described in subsequent figures.

Optional HBM 280 is coupled via expansion (XP) bus 281 to memorycontroller 300 and to reorder buffer 700, in parallel with MM 400, inorder to expand the on-die memory capacity of MM 400. This will allowextended table sizes and accommodate future increasing memory storageneeds.

Referring now to FIG. 3, a functional block diagram 301 is shown of thememory controller (MC) 300 and memory scheduler (MS) (scheduler) 310-1for main memory, according to one or more embodiments. Input Interfacedetails are expanded beyond prior FIG. 2 to indicate that the PCS/FM206A, 206B contains a decoder 207A and 207B coupled to mux 210A and210B, respectively, to decode the transport protocol of the incomingdata and frame the data appropriately for on-chip processing. Lines outof mux 210 provide payload A from port A (from PFE 102-1) while linesout of mux 210B provide payload B from port B (from optional processor102-2) of FIG. 1, which are both communicated in parallel to memorycontroller 300 and PE array 500.

The memory controller 300 includes a plurality of partition controllers(PTC) 302-1 to 302-p, wherein p=4 for the current embodiment, to beequal to the quantity of partitions in MM 400. The components shown in302-1 PTC are replicated in all partitions, e.g., 302-1 through 302-p.CMDs and their associated data are output from 210A and 210B MUXes andsorted into input queues 308 (represented by the box icon therein), andspecifically into buffer 308A through 308B, for port A and port Brespectively, with specific scheduling domains 1-4 shown as SD-1 throughSD-f. The value of f can be any quantity of scheduling domains, with thecurrent embodiment using f=4, for port A, and scheduling domains 5-8shown as SD-(f+1) through SD-2f, for port B. Thus, each PTC has its owninput queues 308 for both ports A and B. Scheduler 310-1 is coupled tothe input queues 308, as well as to memory access CMD queue 309, whichwas generated by PE array 500 queues, and an optional debug CMD queue(not shown), can also be scheduled by scheduler 310-1. In particular,scheduler 310-1 selects a memory access CMD and associated data forwrites, to be processed per a load-balancing schedule. The loadbalancing performed by scheduler 310-1 can be weighted, e.g., to favormemory accesses from PE array 500, such as twice as frequently asexternal memory access CMDs in input queues for SD 308. Alternatively,the scheduler 310-1 can load balance evenly using a round robintechnique, or can utilize a randomized input to select from which buffer308A and 308B the next memory access CMD will be taken, or can pick theoldest CMD queued to have the highest priority. When a given memoryaccess CMD contains a conflict that violates the memory usage rules, asspecified by the memory designer, then arbitrator block 312-1 resolvesthe conflict by stalling one of the conflicting memory accesses, andchoosing the other CMD to proceed. For example, if a given CMD wants toperform 4 reads (Rs) from and 4 writes (Ws) to a given memory partition,and two of the reads are from the same memory bank within a memorypartition, which is a known violation, then the scheduler will bufferone of the reads to that same memory bank and save it for a subsequentread that can accommodate the read request, and that doesn't have aconflict with the memory bank in question.

For a write command, typical logic is shown as write buffer 314-1, whichcan proceed directly to the output (O/P) mux 320-1 for a straight write,or can be muxed with read (RD) data via RMW mux 316-1 to have the‘modify’ instruction of the RMW executed by ALU 318-1. Accesses that areoutput from PTC 302-1 are shown as lines 213-1 to 213-a, which can beeither read control, write control and write data, or RMW control anddata. Output 219-1 to 219-m from MM 400 can be returned back as readdata for a RMW operation. The MM 400 is shown with multiple partitions406-1 through 406-d which are described in more detail in the followingfigure.

Referring now to FIG. 4, a functional block diagram is shown of a mainmemory (MM) 400 portion of the MMCC according to one or moreembodiments. MM 400 is shown here in more detail than in prior FIG. 3,MM 400 is comprised of a plurality of memory partitions 406-1 to 406-p,where p is any number, but p=4 for the present embodiment. Eachpartition 406-1 through 406-p is comprised of a plurality of memorybanks 410-1 through 410-b, where b is any quantity, but is 64 in thepresent embodiment. And each memory bank 410-1 through 410-b comprises:a memory access controller (MAC) 404-1 coupled to redundant remapregisters 408-1 for indicating memory addresses that have been replacedwith redundant memory; an ALU 450-1 for performing local operations; anda buffer 422-1 for buffering read and write data into and out of thememory modules, MOD 0 through MOD N via MUX 423-1, including redundantcells 430-1. Word size is shown as 72 bits, though any size word can beutilized. Each partition 406-1 through 406-p has four write (W) linesand 4 read (R) lines in the present embodiment that can be accessedsimultaneously for a given partition, so long as the access rules aremet, e.g., none of the Ws and Rs can go to the same bank, though anyquantity can be designed in with sufficient control logic. More detailon the memory partitions and banks is disclosed in: i) U.S. Pat. No.8,539,196, issued Sep. 17, 2013, entitled: “HIERARCHICAL ORGANIZATION OFLARGE MEMORY BLOCKS”, issued Sep. 17, 2013; and ii) U.S. patentapplication Ser. No. 12/697,141, filed Jan. 29, 2010, entitled “HIGHUTILIZATION MULTI-PARTITIONED SERIAL MEMORY”, both of which are commonlyassigned with the present application and both of which are herebyincorporated by reference in their entirety

Referring now to FIG. 5A, a functional block diagram is shown of aprogrammable engine cluster (PEC), according to one or more embodiments.PE array 500 is comprised of a plurality of PECs 502-1 through 502-g,with each having a plurality of PEs 550-1 through 550-t, with t and gbeing any values, but are g=8 and t=4 in the present embodiment, for atotal of 32 PEs per PE array 500. Details of the individual PEs 550-1through 550-t are provided in the next figure.

The PE array 500 also includes logic blocks aggregator 510A and 510B,one for each of ports A and B, respectively, coupled to allocator 520,whose function is to allocate CSUB CMDS from aggregators 510A and 510Bto one of the PEs 550-1 to 550-t from the pool of PEs in PECs 502-1through 502-g. Each aggregator 510A, 510B includes a command queue (Q)512A for buffering a CMD in a specific partition according to the typeof the CSUB in the CMD that was received on line 215-1 to 215-c.Aggregator 510A classifies the CMD into a partition of CMD Q 512Aaccording to categories configurably established by the user, i.e., thePFE 102-1, at initialization. Thus, in one example embodiment, all CMDscan be classified into one of the following four types, and respectivequeues in CMD Q 512A: i) all CMDs for a LPM CSUB are classified as afirst type in a first queue; ii) all EXACT MATCHES and SEARCH CSUBs areclassified together as a second type, and interleaved per a FIFOprotocol, in a second queue; iii) any high-priority CMDs for a LPMhaving a unique CSUB opcode are classified as a third type of CMD thatis a lightly populated and frequently accessed queue; and iv) all otherCSUB CMDs are lumped together as a fourth type in a fourth queue.

Allocator 520, couples aggregators 510A, 510B to PEs 550-1 through 550-tin each of the PECs 502-1 through 502-g, in order to allocate a CSUB CMDto an eligible PE. Load balancing can be implemented in a number ofways. First, by defining the classification system of CSUB CMDs, anatural prioritization occurs by either oversizing or undersizing aclassification of CSUB CMD rate for that classification. For example,one network application could have an extremely frequent occurrence oftwo types of CSUB CMDs, and an infrequent occurrence of all other CMDs.If a user classifies the two types of CSUB CMDs having frequentoccurrences as separate types of CSUB CMDs, then it has an effect ofload balancing, versus classifying them together as a single type.Another method for the user to effectively configure load balancing ofthe PE array 500 is to: i) load the entire instruction set into theinstruction memory for all PEs so that all PEs are eligible to executeany CSUB CMD, which effectively flattens the PE array 500 into a set offungible resources (PEs); ii) load instructions for only a subset of theCSUB codes in a number of PEs, either to detune them, or because theinstruction set for the entire CSUB exceeds the capacity of theinstruction memory for the PE; or iii) arbitrarily assign a quantity ofPEs to given type of CSUB CMDS.

Map table 522 of allocator 520 maps the types of CSUB CMDs against theIDs of PEs that are eligible to execute those types of CSUB CMDs, andthe status of the multiple threads of those PEs. Whenever a threadresource of a given PE is available, and the PE eligibility matches aCSUB CMD waiting in the CMD Q 512A, then the allocator allocates theCSUB to the eligible PE on a FIFO basis where the first CMD in the queuefor that type of CMD will be allocated. By providing thisconfigurability, a user has control over where the resources areallocated. The user can also update the allocation depending on fieldperformance and network demands by reinitializing MMCC 200 with anupdated map table. The redundancy of the PE resources provides a backupvia alternate processing paths if a given PE stalls or fails, providedthat another PE is assigned to the same type of CSUB CMD. A CSUB callfor a memory access from PE array 500 is individually and independentlycommunicated on 217-1 through 217-r lines to the memory controller 300of FIG. 3, with results also being independently returned to PE array500 directly via line 223-1 through 223-v, through memory controller 300thereby reducing latency of a memory access up to three orders ofmagnitude. Output data from a completed CSUB code is communicated out ofPE array 500 via lines 221-1 through 221-k lines.

Referring now to FIG. 5B, a functional block diagram is shown of anindividual programmable engine (PE) 550-1, according to one or moreembodiments. The heart of PE 550-1 is the computation engine 560-1,comprised of logic functions 600-1, which are described in a subsequentfigure, coupled to a general-purpose register (GP REG) 562-1. Logicfunctions 600-1 comply with a classic five-stage reduced instruction setcomputer (RISC) protocol that executes one instruction per cycle.However, computation engine 560-1 is not a general purpose CPU (GPCPU)because it does not have an operating system (OS), and does not supportan ‘interrupt’ or a ‘cache-control instruction. Once an instruction isstarted on the PE, it runs until completion.

The PEC 502-1 is comprised of a plurality of PEs 550-1 to 550-t coupledto each other and to shared local data memory (LDM) 540-1 that providesfaster access of urgent or frequently used data compared to MM 400. PEC502-1 offers fastest access because of its closer proximity to the PEs550-1 to 550-t, and because it is an SRAM memory type, which is fasterthan the eDRAM memory type of MM 400. The LDM 540-1 is also accessibleexternally from PEC 502-1 by line 524-1, to other PEs in other PECs502-1 to 502-g, though the extra distance and logic required for anaccess external to its given PEC 502-1 results in slightly longer accesstime. By disposing memory locally, reduced latencies are accomplished.By sharing the local data memory 540-1 resource via intra-PEC orinter-PEC, memory resources can be effectively shared to accommodate anintermittently high memory demand in a given PE.

The CSUB CMD is communicated to the PE 550-1 via one or more lines 215-1through 215-p. The CSUB CMD points to a starting line of the given CSUBcode in instruction memory (IM) 554-1 or in CMD registers (CMD REG)552-1, which is subsequently decoded by decoder 558-1 and processed bycomputation engine 560-1. As indicated by the partitions icon, IM 554-1and CMD REG 552-1, these resources are partitioned to a quantity ofprocessing threads instantiated by a user of the PE array 500. That is,the multi-threaded processing threads of the CP are configurable,heterogeneously through the array. Thus, one or more of the PEs could beconfigured to operate concurrently with different quantity of threads.For example, a quantity of PEs could be configured with differentthreading as follows (quantity PEs/number of threads: 1/8, 5/7, 4/6,1/5, 19/4, 1/1, 1/0 (not used). This offers a user a wide variation inperformance adaptation to given application. Furthermore, thesedifferently configured PEs could be assigned different types of classesof CSUB CMDs. Thus, short CSUBS could be assigned to run on PEsconfigured with 8 threads because short CSUBs will finish quicker.Moreover, longer CSUB code can be assigned to PEs configured with only 1or two threads, because they need more bandwidth to complete the SUB.Thus, the bandwidth of the resource is divided equally among thequantity of partitions selected, from one to eight in the presentembodiment, as determined by the user and as implemented duringinitialization of MMCC 200. Memory register 556-1 is similarlypartitioned per processing thread to hold data values fetched bycomputation engine 560-1 via line 217-1 to 217-p from MM 400 andreturned from MM 400 on lines 223-1 through 223-v. Output results fromcomputation engine 560-1 are stored in results register (REG) 570-1 perthe processing thread partition therein, and finally output on lines221-1 through 221-k.

Referring now to FIG. 6, a functional block diagram is shown of logicfunction blocks 600-1 of the PE, according to one or more embodiments.Instructions read from CMD REG 552-1 or IM 554-1 of FIG. 5B and executedby computation engine 560-1 of PE 550-1, are more specifically executedby logic functions 600-1 of the present figure. Results are returnedback to GP REG 562-1 and fed back into logic functions 600-1 as requireduntil the CSUB code is completed and results are output as 221-1 through221-k calculations. Data required for execution of an instruction can befetched from MM 400 or LDM 540-1 and input on line 223-1 through 223-vas a load of data.

Logic function block 600-1 shown in FIG. 6 includes a plurality ofdifferent logical functions integrated together for a specificapplication, and can be any combination of said logical functions. Inthe present embodiment of network processing, the functions relating topacket processing and address lookup, traffic management, etc. arerelevant. Hence, specific functional blocks are microcoded into the MMCC200 for fast processing and minimal latency. In particular, a hash logicfunction (F(x)) block 610-1 is shown with a single stage 612 comprisinga cross-connect (X-connect) 614 block coupled to an adder block 616.This block is programmably recursive, per the user for repetitiverounds. More detail is disclosed in PCT Patent Application No.PCT/US14/72870, filed Dec. 30, 2014, entitled “RANDOMIZER CIRCUIT”,which application is commonly assigned with the present application andwhich is hereby incorporated by reference in its entirety. Anotherfunction block 610-2 is an arithmetic logic unit (ALU) 618. An entirelibrary of functional blocks, through functional block 610-f for a newfunction with logic 690, can be designed into the MMCC 200 in order toprovide a desired scope of co-processing functionality desired by adesigner, and as needed by a host processor.

Referring now to FIG. 7, a functional block diagram is shown of areorder buffer (ROB) 700 for maintaining a FIFO sequence across the I/Oscheduling domains of MMCC 200, according to one or more embodiments.The ROB 700 is comprised of a separate ROB 700A for Port A, with inputsA1-Af for SD-1 through SD-(f), and a separate ROB 700B for Port B, withinputs B1-Bf, for SD-(f+1) through SD-(2f), where f=4 in the presentembodiment for a total of eight scheduling domains.

The ROB 700A for port A, which mirrors ROB 700B for port B, comprises acommand queue output buffer (CQOB) 708A and a DOB 720A, each havingseparate partitioned memory per scheduling domain, 708A SD-1 to 708ASD-(f), and 728A SD-1 to 728A SD-(f), respectively, to store CMDs fromreservation line 211 reserve and to store the results received fromresults mux 230, shown in FIG. 2, respectively. Thus, outputs from bothMM 400 and from PE array 500 are interleavedly received and sorted perSD by results mux 230 into the respective buffers associated with theappropriate SD.

For example, in ODB 728A SD-1, the first in, and first out entry is thebottom entry of an output for the ‘longest prefix match’ (O/P LPM) CMD,which corresponds to row 3 of FIG. 9, whose data is completed (COMPL.),and thus will transmit out Port A output interface 226A, 224A upon itsload balanced turn, e.g., round robin. The next output data in the ODB728A SD-1, is the ‘read’ access CMD, corresponding to row 2 of FIG. 9,which has also completed loading its data (COMPL.), and is awaiting theprior output, LPM CMD, to finish transmitting first. The next entry inODB 728A SD-1 is the ‘exact match’ CMD, which corresponds to row 1 ofFIG. 9, is still pending results data ( . . . ) but whose discrete sizeof data is known and therefore reserved in the ODB 728A SD-1.

In comparison, the ODB 728A for SD-(f) of the present figure shows a‘read’ CMD entry at the bottom of the queue corresponding to row 19 ofFIG. 9, which has the highest priority to transmit but is still loadingdata ( . . . ). This delay occurred possibly because the read CMD wasstalled due to low priority in the memory scheduler 310-1 or because ofa repeating access conflict per arbitrator 312-1 of FIG. 3. Regardless,the ‘exact match’ CMD, which corresponds to row 18 of FIG. 9, that issecond from bottom output data buffer 728A, and has completed savingoutput data (COMPL.), is prevented from being transmitted because theFIFO protocol dictates that data associated with the bottom CMD of‘read’ should be the first item transmitted.

Regardless of the wait for SD-(f), other scheduling domain queues, i.e.,SD-1, can still transmit output results if the appropriate FIFO CMD datafor that scheduling domain has completed loading in output data buffer728A. Thus, the modularity of the present embodiment might incur somestalls or delays, but all other autonomous portions of MMCC 200 cancontinue executing, thus providing a higher overall efficiency and dutycycle of the chip as a whole. Further up the queue in ODB 728A forSD-(f), the reservation of output data buffer space for the ‘exactmatch’, corresponding to row 18 of FIG. 9, has not completed adeterministic calculation yet (UNRESERVED) and thus, will stall thesaving of output data for the subsequent CMD in the queue of ‘LPM’ evenif its data is ready to write. This rule is to preserve the FIFOprotocol, which could be thwarted if output data for LPM CMD consumesall available memory in output data buffer 728A for SD-(f) and preventsoutput from the earlier ‘exact match’ CMD from writing its data to theoutput data buffer 728A.

Flowcharts

Referring now to FIG. 8A, a flowchart 800-A is shown of a method foroperating an IC with a multi-threaded I/O scheduling domain in the MMCC,according to one or more embodiments. Operation 802 couples the MMCCports to the host. Each individual port, A and B, can be slaved to asingle processor, i.e., PFE 102-1, as shown in FIG. 1, or eachindividual port, A and B, can be slaved to any quantity of processorssuch that each port is slaved only to one processor and not shared bymore than one processor. Thus, in one embodiment, port A is slaved toPFE 102-1 while port B is slaved to processor 102-2 as shown in FIG. 1.Because the scheduling domains are tied to the ports, the slaving of theMMCC 200 to a plurality of processors does not require the plurality ofprocessors to schedule the resources of the MMCC 200 themselves, anddoes not require the processors to manage conflicts with each other.Thus, a single chip can be used in different line card sockets that willslave the chip to a different quantity of processors without requiringany kind of reconfiguration or parameter setting in the chip.

Operation 804 initializes the MMCC by loading CSUB code programinstructions, CSUB CMD classification rules, and a PE assignmentschedule from the host, PFE 102-1, to the NVM 240 of the MCM 201 in FIG.2 and into IM 554-1 as shown in FIG. 5B. Subsequent initializations ofMMCC 200 read the program instructions directly from the NVM 240 into IM554-1. The PE classification rules will: i) specify different types ofclassifications of CSUB CMDS; ii) assign a quantity of PEs for each ofthe types of CSUB CMDS. These rules and assignments are stored in MAPtable 522 of allocator 520, as shown in FIG. 5A. The choice of the rulesand assignments is a form of load balancing available to the user forthe multiple redundant PE resources in MMCC 200.

Operation 812 receives memory access commands and CSUB commands from oneor more hosts on a scheduling domain basis at an input interface 204 ofFIGS. 2 and 10. Because the SDs are globally unique in the MMCC, theinput ports and data received on them have a unique identity, as shownin FIG. 2. The host can schedule any memory access CMD or CSUB CMD itwants to in any of the scheduling domains. Thus, the host can spread thesame commands types across multiple scheduling domains and therebyprovide a type of load balancing on the requests as well, assuming around robin load balancing is performed by memory controller 310-1.Alternatively, different types of prioritization can be accomplished byreserving some scheduling domains for higher priority memory accessesCMDS requested by the host. This would have the effect of bypassing theloaded queues in other scheduling domains. A user-programmable loadbalancing instruction to the memory scheduler specifies the sequence andrepetition with which to select the queues in the PTC 302-1. Those loadbalancing instructions can also provide for priority access to a givenscheduling domain queue over all other queues whenever the givenscheduling domain receives a CMD. As noted in FIG. 5A, different loadbalancing techniques are available for prioritizing or balancing CSUBCMDs vis-à-vis PEs.

Once a CMD is received by the MMCC 200, the CMD is communicated inparallel to a plurality of locations on the chip as specified inoperations 816, 818, and 820. In particular, operation 816 communicatesthe command via reservation line 211 of FIG. 2, to ROB 700 shown in FIG.7, IFF the CMD requires an output. This is done to maintain a FIFOprocessing of commands received at the MMCC 200, on a schedulingdomain-by-scheduling domain basis. If a CMD does not require an output,then the CMD is either not communicated to, or is ignored by, ROB 700.In an alternative embodiment, a global FIFO sequencing is maintained byassigning a global sequence value that tracks with the data throughoutthe execution of the CMD. However, one consequence with this globalprotocol would be the potential backing up all subsequent CMDs to astalled CMD slated in a single SD.

In operation 818, the CMD is also communicated parallel to the memorycontroller 300 in general, and specifically to the memory partitioncontroller 302-1 to 302-p specifically, as shown in FIG. 3. If and onlyif (IFF) the CMD is a memory access CMD, then it along with anyassociated write data is sequentially buffered per its SD in theappropriate input queue buffer 308, e.g., one of 308A buffers from SD-1through SD-(f). Else, the CMD is disregarded.

In operation 820, the CMD is also communicated parallely to theaggregator 510A for Port A, or 510B for Port B. IFF the CMD is a CSUBCMD, then it is classified by aggregator 510A in FIG. 5A according toits type of CMD, per rules received in operation 804. It is thenbuffered in an appropriate partition in command queue 512A per the CMDtype. A wide range of load balancing options and prioritization schemesfor the CSUB CMDS are available to a user according to how the userconfigurably classifies the CSUB CMDS, as disclosed in FIG. 5A and inthe summary section herein.

Operations 819 and 821 inquire whether the command buffers 308 in PTC302-1 of FIG. 3 and command/data buffer 512A of FIG. 5A, respectively,are nearly full. If they are, then MMCC 200 notifies the host peroperation 823, which will then restrict issuance of new CMDs. Thispresent embodiment is a notice based flow control, which is loweroverhead for the user. The notice can be either a data link layer (DLL)notice or a transaction layer notice using a communication protocol,e.g., gigachip interface protocol. One interface protocol used by theMMCC is described in more detail in U.S. Pat. No. 8,370,725, issued Feb.5, 2013, entitled: “COMMUNICATION INTERFACE AND PROTOCOL”, issued Sep.17, 2013, which is commonly assigned with the present application andwhich is hereby incorporated by reference in its entirety. Thetransaction layer notice is faster but consumes more overhead. However,the expense of using a transaction model frame for a transaction alertis justified because of its infrequent occurrence and because of theseverity of the exceptional condition such as: queues nearing ‘full’ ornearing ‘empty’; or one or more uncorrectable error conditions like amulti-bit error detected by EDC. However, the present disclosure is alsowell suited to an alternative embodiment of using a token system,wherein the host manages a given population of tokens for differentresources on the MMCC, extracting a token when requesting a resource,and crediting a token when a resource has completed a task.

Operation 830 arises if buffers in PTC 302-1 are not full and can storethe CMD and associated data. The CMD is scheduled in operation 830 perthe scheduler 310-1 of FIG. 3, and executed on MM 400 of FIG. 4. Thescheduling operation 830 includes resolving any potential conflicts byarbitrator 312-1 for an access CMD to a same memory bank in a partitionby having one of the accesses postponed for another cycle. A wide rangeof load balancing options and prioritization schemes for schedulingmemory accesses per the scheduling domain are available to a useraccording to how the user configurably programs the scheduler toschedule each of the given scheduling domain queues 308A SD-1 to 308BSD-(2f), including round robin and weighting of queues, as disclosed inthe summary section herein. The data accessed by MMCC 200 is not datathat needs to be fetched from an off-chip memory into an on-chip cache.Thus, the present disclosure reduces latency by making all memory accessto on-chip memory.

Operation 832, similarly to operation 830, arises if CMD and databuffers in aggregator 510A and 510B are not full and are capable ofstoring the CMD and associated data. In operation 832 allocator 520allocates, on a FIFO basis, a given CSUB CMD retrieved from the commandqueue 512A to a next available thread of a PE in PE array 500 that iseligible per map table 522, as shown in FIG. 5A. The eligible and/orassigned PEs for a given type of CSUB CMD are specified by the user inoperation 804. Details for operation of the multi-threaded processoronce a CSUB CMD has been assigned are described in subsequent flowchart800-B.

After operations 830 and 832, an inquiry per operation 834 determines ifthe host requires an output result. If a result output is not required,then operation 835 saves the data as required, e.g., typically to MM 400in FIG. 4, or to LDM 540-1 in FIG. 5B.

If a result output is required, then operation 836 reserves adeterministic portion of the DOB 720A or 720B, depending on the port, asshown in FIG. 7. If a deterministic calculation cannot be made, thensubsequent CMDs trying to write output result data to the DOB 720A and720B will be stalled until such time as the deterministic calculationcan be made. An example is shown in FIG. 7 as output data buffer 728Afor SD-(f) where the ‘exact match’ CMD is listed as ‘UNRESERVED,’thereby blocking any subsequent CMDs, i.e., output for ‘LPM’ CMD isSTALLED, in the same SD from writing their output results in the outputdata buffer 728A SD-(f). This requirement exists to preserve the FIFOaspect of the MMCC 200. Else, data from a subsequent CMD might consumethe entire buffer, and void the FIFO protocol for an earlier CMD. If asubsequent CMD is stalled, then the given thread for that CMD on thegiven processor is stalled, and the round robin processing thread willcontinue to skip that thread and process the remaining threads until thegiven thread becomes reactivated from the stall, i.e., because the priorCMD has completed its deterministic reservation of output data buffer.

After the output data buffer is reserved for a deterministic amount ofdata, operation 838 saves data from the MM 400, via partition controller302-1, or from the PE array 500 as output results in the DOB 720A or720B in FIG. 7. Operation 840 inquires if the output data results frommemory access or PE subroutine are completed for the given CMD totransmit out of ROB 700. If not, then operation 841 blocks prematurelycompleted threads, as required to maintain FIFO protocol. This isillustrated in FIG. 7 for DOB 720A for SD-(f) where output data for the‘read’ CMD is still loading ( . . . ), and thus, the output data for thesubsequent ‘exact match’ CMD, while showing as COMPL. is blocked frombeing transmitted until its turn in the queue. While blocking aprematurely completed thread from transmitting, operation 838 isrepeated to receive and save output data from the MM and the PEs to fillthe output data buffer for the CMD currently queued for an immediatetransmission.

If inquiry 840 results in a yes, then that output results are completelywritten into DOB 720A for the current CMD and operation 842 transmitsthose output data results to the host in the same sequence as thecommand received from the host to effectuate a FIFO protocol. This isillustrated in FIGS. 7 and 9.

Referring now to FIG. 8B, a flowchart 800-B is shown of a method foroperating a multi-threaded coprocessor (CP) in the MMCC, according toone or more embodiments. Once operation 832 of FIG. 8A allocates theCSUB CMD to a PE thread, then operation 858 of the present flowchartcommunicates the CSUB CMD and associated operand data to a processingthread portion of CMD REG 552-1. The CSUB CMD has start line instructionthat points to IM 554-1 of FIG. 5B for the appropriate CSUB code, whichis then executed by computation engine 560-1.

In operation 860, the PE thread executes the subroutine by fetching,decoding, executing, and writing back per the program instructions, on afine grain multithreading cycle of one execution per cycle. Table 900 inFIG. 9 provides examples of the instruction sequences in column 5 for agiven command in column 2. Supporting operation 862 is the implicitlogic function calls to logic function block 600-1 shown in FIGS. 5 and6 and the memory access calls to main memory 400 via queue 309 in apartition controller 302-1 as shown in FIG. 3, as well as return datafrom those calls.

In operation 864, the PE indexes to a next thread in its queue.Operation 866 inquires if the next PE thread is completed. If the nextPE thread has completed, then operation 869 updates allocator 520 ofFIG. 5A to indicate an open thread resource to which allocator canassign a new CSUB CMD. If the next PE thread has not completed, thenoperation 868 inquires if the next PE thread is idle or stalled. If thenext PE thread is idle or has stalled, then operation 864 indexes the PEto the next thread and returns to operation 866 to execute theinstruction for that next thread. If the next PE thread is not idle orstalled, then operation 860 executes the instruction for the giventhread.

Referring now to FIG. 8C, a flowchart is shown of a method forreconfiguring a PE with a new instruction set during field operation,according to one or more embodiments. Operation 880 receives a commandvia serial port 205, using a SPI, SMBus, or I2C protocol, for an infield reprogramming of one or more target PEs for updating or adding anew subroutine. Operation 882 deassigns a target PE from map table 522of allocator 520, shown in FIG. 5A, by overwriting an entry in map table522 during idle cycles using debug microcontroller (uC) 207 as shown inFIG. 2 and by allowing the subroutines currently running on the targetPE to execute to completion. Then operation 884 writes the updated ornew subroutine reprogramming into IM 554-1 of the target PE of FIG. 5B,and optionally to NVM 240 of FIG. 2 using debug uC 207 during idlecycles. Operation 886 reassigns Target PEs in map table 522 byoverwriting the PE back into the table using debug uC 207 during idlecycles.

Case Table

Referring now to FIG. 9, a case table 900 is shown illustrating an I/Oscheduling domain and the PE multi-threaded domain for a single portaccording to one or more embodiments. The table 900 has 20 rows of data,itemized as rows 1-20, one row for each CMD (col. 2) received at MMCC200 on a single port, e.g., port IN-A, as shown in FIG. 2. The onlyinput scheduling domains (SD IN) shown are 1-4 (col. 1) are associatedwith the first port, port A or IN-A, slaved to PFE 102-1 for MMCC 200exchanging data DD-memory access and/or subroutine calls therebetween,per FIG. 1. A similar table with complementary entries would reflectrespective traffic for the second port, port B, or IN-B, slaved eitherto the same external processor, PFE 102-1 or to a separate externalprocessor, i.e., optional processor 102-2.

A general overview of the table follows. The entity controlling the dataof each table cell is shown under the column headings. Thus, the SD(col. 1) and the CMD (col. 2) are input by the user, PFE 102-1 ofFIG. 1. All other table entries are controlled by MMCC 200, therebyrelieving the user, PFE 102-1, of administrative overhead and alsoincreasing bandwidth and efficiency of the user. As shown in the CMDscolumn (col. 2), memory accesses are interleaved with subroutine calls,as dictated by the needs of the user. Immediately upon receipt of theCMD at the input of the MMCC 200, the CMD is written in the output CMDqueue (col. 8) for the same SD (col. 7) as implemented by CQOB 708A ofFIG. 7 in SD-1 (not shown). Thus, the FIFO protocol regarding inputcommands and output results is preserved. A notable exception is when aCMD requires no output, such as WRITE command with no acknowledgerequired (no ack), as shown in row 6, as noted by “JJ” in col. 8, whichshows ‘skip’ as an illustration of not reserving a slot in the outputcommand queue. Memory scheduler (MEM SCHED) (col. 3) ignores CSUB CMDS(shown as ‘- - -’), while PE-THREAD (col. 4) executes those CSUB CMDS.The opposite is true for a memory access CMD, with the PE-THREADignoring the memory access CMD (shown as ‘- - -’), while the memorycontroller and scheduler execute them. The PE-THD (col. 4) indicates theprocessing engine and the thread on that PE that will execute the CMD.The sequence of instructions (col. 5) associated with a CMD is whateverinstructions the user chooses, either as a default subroutine providedwith the MMCC 200, or as upgraded by NPU software or by a third-partysoftware provider with expertise in network algorithms. The subroutineinstructions were loaded in IM 554-1 of FIG. 5B during initialization ofthe MMCC 200. The instructions are executed on respective cycles (col.6) of the PE, assuming no stalls exist on the classic RISC protocol thatexecutes one instruction per cycle, which equates to one execution for agiven thread every 8 cycles for a PE configured with eight threads. Theoutput scheduling domain (SD OUT) (col. 7) matches that of SD-IN (col.1) for a given CMD.

A specific example in the table follows. Starting with a first examplein the first row entry (row 1), the “queue full” status, noted as “HH”,for that given SD of ‘0’ prevents acceptance of a new CMD, so a noticeis sent back to the user indicating this status, which will halt newCMDs being added to this SD until the queue opens again.

In another example, row 1 indicates a user-selected SD of ‘0’ and a CMDof “EXACT MATCH” (with associated ‘key’ data, not shown). In response,the memory scheduler (col. 3) ignores the CMD as a non-memory CMD, andinstead the AG 510-A, per FIG. 5A, receives the CMD and classifies itper the CMD Q /TYPE functional block 512A, whereupon AL 520 determinesthat PE-2 is eligible to process that type of CMD, on available thread 1of 8 (THD-1/8, noted as ‘FF’). Once relayed to PE-2 (alias of 550-1 inFIG. 5B) on line 215-1, decoder 558-1 decodes the CMD and points to asequence of instructions, in IM 554-1 or CMD REG 552-1, starting with‘hash’ and continuing on to ‘extract, main mem load, extract . . . ’(col. 5) at cycles 2, 10, 18, 26, and 34 (col. 6), (respectively notedas ‘GG’) one instruction for every eight cycles of the PE 550-1.

Continuing with a next example in row 6, the user sends a CMD for a‘write (no ack)’, which the memory controller 300 and ultimately thememory scheduler 310-1 of FIG. 3 decode and schedule for an access to arequested partition of memory (not shown). However, because the CMD doesnot require an acknowledge (ack), no entry is shown in col. 8 for outputCMD queue, noted as “JJ”.

In a next example, row 8 has a CMD of “CMD X”, noted as “KK” which islinked by row 18 CMD ‘exact match’, noted as “LL”. While default forCSUB CMDs and memory access CMDs is to not have the CMDs linked, butrather free flow through the redundant resources, it is possible to linkCMDs as shown.

For the row 9 example of an ‘exact match’, the thread assigned isTHD-2/2, shown by “MM” for PE-16. In this embodiment, PE-16 only has twothreads assigned, while other PEs have 8 or 4 threads assigned. Thedetermination of how many threads each PE has is a configurableparameter set by the user, depending on their need for processing powerin any given type of CMD, for which PEs are slated. Thus, a two-threadedPE will return the results of the two CSUB CMDs faster than if they werebeing executed on an 8-thread PE, which is interleaved with six otherCSUB CMDs.

Flow Path Illustration

Referring now to FIG. 10, a flow-path illustration 1000 is shown ofmultiple commands concurrently executing on the MMCC 200 to both accessMM 400 and to call subroutines in the PE array 500, according to one ormore embodiments. Four commands, CMD-1, CMD-2, CMD-3, and CMD-4 from asingle scheduling domain are illustrated as having concurrent andindependent pipeline execution on MMCC 200. Letters are sequentiallyarranged along each flow path, and from path to path, though the lattercan be executed in parallel.

Beginning with the example of CMD-1, a read operation enters inputinterface 204 at point A and is reserved at ROB 700 simultaneously. Itis then scheduled in memory controller 300 to access a read address inMM 400 at point B, and then the output data queued in ROB 700, andfinally output as OP-1 from interface 224 at point C. Input interface204 refers to SerDes 204-B, while output interface refers to SerDes224-B, as shown in FIG. 2.

Next, for CMD-2, a CSUB CMD is received at interface 204 at point D, andis executed by custom logic 600 in PE array 500 at point E, which has amemory call instruction that is submitted through memory controller 300and into MM 400 at point F, which read data is loaded directly back intoPE array 500 at point G. At point G, the next CSUB code instructionexecuted in custom logic 600 is a particular data processing operationwith another memory call instruction that is again submitted throughmemory controller 300 and into MM 400 at point H, which read data issimilarly loaded directly back into PE array 500 at point I. This loopcontinues for two more iterations through points J, K, and finallyfinishes with a write into memory at point L. No data is output back tothe host, so there is no flow path into the ROB 700 or output interface224. CMD-2 illustrates the benefit of executing CMDs on the MMCC 200with a main memory and coprocessor integrated on one chip, which is thatall the iterations back and forth to fetch data from MM 400 and theprocessing of data in PE array 500 do not pass the chip interfaces 204,224, which thereby substantially reduces latency for the CSUB codeexecution. The alternative would be to execute the instructions off-chipand then have to pass through an input and output interface of the PFE102-1 and the DRAM 113 for every data fetch from memory. Because everymemory access using the integrated PE array 500 of the presentdisclosure saves up to three orders of magnitude in cycle time, thebenefits multiply with iterative calls to memory. Additionally, theprocessing and overhead savings to the off-chip processor PFE 102-1 byusing the present disclosure are equally beneficial. The alternativewould require PFE 102-1 to: schedule a memory fetch to DRAM 113 of FIG.1; execute the CSUB code on the data at PFE 102-1; handle exceptions andinterrupts associated with the data fetches and processing; and finallywrite the final data back into DRAM 113.

The internal access call of MMCC 200 has substantially less latency,e.g., up to three orders of magnitude fewer cycles, compared to anexternal access call. This is because a direct internal memory access,call by the present method for both read and write and RMW, bypasses thechip interface (I/F) and its associated operations and latency. Incomparison, a similar conventional operation would require a discreteprocessor/coprocessor (PFE) chip to schedule a data fetch from adiscrete and separate main memory chip. This model would incur thelatency and power consumption of encoding/decoding,packetizing/unpacking, serializing/deserializing, andtransmitting/receiving data across the interfaces of both chips Doingthis repeatedly for each of the loops in the subroutine, i.e. compoundsthe problem. If the processor/coprocessor is also responsible formanaging the conflict avoidance in scheduling a data fetch, then thisconsumes precious processor bandwidth while tasks that are moresophisticated wait. Additionally, driving large quantities of data overexternal lines, which are longer than an internal chip path, consumespower and subjects the data to noise.

CMD-3 is similar in that it is received at input interface 204 at pointM, makes a CSUB code call at point N, executes an instruction thatrequires a data fetch, which is routed through memory controller 300 andinto MM 400 at point O and which subsequently loads the data directlyinto PE array 500 at point P, thereby completing the CSUB codeexecution. Output results are communicated to ROB 700 and subsequentlytransmitted as OP-3 from output interface 224 at point Q. The OP-3 fromCMD-3 is sequenced after OP-1 from CMD-1 to maintain the FIFO protocol,since they are in the same scheduling domain.

Finally, CMD-4 is a CSUB CMD received at input interface at point R,then executed in custom logic 600 in PE array 500 at point S, whichoutput result is then scheduled through memory controller 300 to bewritten in MM 400, with no output being returned to the user.

Overall, substantial benefits arise from concurrently, independently,and interactively processing the memory access commands and the CSUBcommands in the integrated main memory and coprocessor, respectively.Because both the main memory and coprocessor having multiple parallelredundant resources that autonomously execute their respective commandsin a load-balanced manner, and because the transmitting of the outputresults is slaved to the sequence of the received input CMDs, the entireinternal process is transparent to the user.

Chip Layout

Referring now to FIG. 11, a layout diagram 1100 is shown illustratingthe placement and size of MM compared to the CP, according to one ormore embodiments. Eight memory partition blocks 0-7 occupy the outerportions of the die and consume approximately 60% of the chip area.However, any area size or partitioning scheme of memory can benefit fromthe advantages of the present disclosure, including memory areas above30%, 40%, and 50% of the chip area. The portion of the memory partitionblock that is actually memory cells can range from 20% up, 30% up, or40% and up, with the remainder of the block slated for logic, routinglines, sense amplifiers, decoders, memory access controller, redundantmemory, buffers, etc. that support the actual reading and writing, asshown in FIG. 4.

Processing engines are disposed in two rows 1-4 on top, and row 5-8 onthe bottom of the illustration, and separated by interconnects andallocator logic. Every PE is coupled directly to the single allocatorand directly to each aggregator, and daisy-chained to each other. Thishigh degree of direct interconnection minimizes muxing and controloverhead, thereby minimizing latency. Partition controllers PTC 1, 2, 3,and 4 control memory partitions 1-2, 3-4, 5-6, and 7-8, respectively.

SerDes A and B blocks, having both transmit (Tx) and receive (Rx)drivers, are disposed on the horizontal centerline of the die, with theactual Rx and Tx ports, i.e., bumps of the flip chip, are disposed atthe top edge and the bottom edge, respectively. This layout of the inputand output interface provides advantages in routing that reduce noiseand cross-coupling effects from interleaving the Rx and Tx lines. Moredetail on center SerDes and Segregated Rx and Tx lines is disclosed in:i) U.S. Pat. No. 8,890,332, issued Nov. 18, 2014, entitled“SEMICONDUCTOR CHIP LAYOUT WITH STAGGERED TX AND TX DATA LINES”; and ii)U.S. Pat. No. 8,901,747, issued Dec. 2, 2014, entitled “SEMICONDUCTORCHIP LAYOUT”, both of which are commonly assigned with the presentapplication and both of which are hereby incorporated by reference intheir entirety

Alternative Embodiments

References to methods, operations, processes, functions, systems, andapparatuses disclosed herein that are implementable in any means forachieving various aspects, and may be executed by an integrated circuitsuch as a memory device or via a machine-readable medium, e.g., computerreadable medium, embodying a set of instructions that, when executed bya machine such as a processor in a computer, server, etc. cause themachine to perform any of the operations or functions disclosed herein.Functions or operations may include receiving, initializing, reserving,communicating, buffering, scheduling, aggregating, allocating, blocking,transmitting, executing, fetching, decoding, writing back, overwriting,deassigning, updating, tagging, storing, identifying, and the like. Thememory device or similar electronic computing device manipulates andtransforms data represented as physical (electronic) quantities withinthe devices' registers and memories into other data similarlyrepresented as physical quantities within the devices' memories orregisters or other such information storage, transmission, or displaydevices.

The term “machine-readable” medium includes any medium that is capableof storing, encoding, and/or carrying a set of instructions forexecution by the computer or machine and that causes the computer ormachine to perform any one or more of the methodologies of the variousembodiments. The “machine-readable medium” shall accordingly be taken toinclude, but not limited to non-transitory tangible medium, such assolid-state memories, optical and magnetic media, compact disc and anyother storage device that can retain or store the instructions andinformation. The present disclosure is also capable of implementingmethods and processes described herein using transitory signals as well,e.g., electrical, optical, and other signals in any format and protocolthat convey the instructions, algorithms, etc. to implement the presentprocesses and methods.

The present disclosure is applicable to any type of network includingthe Internet, an intranet, and other networks such as local area network(LAN); home area network (HAN), virtual private network (VPN), campusarea network (CAN), metropolitan area network (MAN), wide area network(WAN), backbone network (BN), global area network (GAN), aninterplanetary Internet, etc.

Methods and operations described herein can be in different sequencesthan the exemplary ones described herein, e.g., in a different order.Thus, one or more additional new operations may be inserted within theexisting operations or one or more operations may be abbreviated oreliminated, according to a given application, so long as substantiallythe same function, way and result is obtained.

The specific quantity of components in a series or span of redundantcomponents described in the present disclosure is only by way ofexample, and not by way of limitation. Other embodiments include agreater or lesser number of components and one or many series within acycle.

As used throughout this application, the word “may” is used in apermissive sense (i.e., meaning having the potential to), rather thanthe mandatory sense (i.e., meaning must). Similarly, the words“include,” “including,” and “includes” mean “including, but not limitedto.”

Various units, circuits, or other components may be described as“configured to” perform a task or tasks. In such contexts, “configuredto” is a broad recitation of structure generally meaning “havingcircuitry that” performs the task or tasks during operation. As such,the unit/circuit/component can be configured to perform the task evenwhen the unit/circuit/component is not currently on. In general, thecircuitry that forms the structure corresponding to “configured to” mayinclude hardware circuits. Similarly, various units/circuits/componentsmay be described as performing a task or tasks, for convenience in thedescription. Such descriptions should be interpreted as including thephrase “configured to.” Reciting a unit/circuit/component that isconfigured to perform one or more tasks is expressly intended not toinvoke 35 U.S.C. §112, paragraph six, interpretation for thatunit/circuit/component. Characteristic heuristics references implysystems.

The foregoing descriptions of specific embodiments of the presentdisclosure have been presented for purposes of illustration anddescription. They are not intended to be exhaustive or to limit orrestrict the invention to the precise forms disclosed-even where only asingle embodiment is described with respect to a particular feature. Itis the intention to cover all modifications, alternatives, andvariations possible in light of the above teaching without departingfrom the broader spirit and scope of the various embodiments. Theembodiments were chosen and described to explain the principles of theinvention and its practical application in the best manner, and toenable others skilled in the art to best utilize the invention andvarious embodiments with various modifications as are suited to theparticular use contemplated. It is intended that the scope of theinvention be defined by the Claims appended hereto and theirequivalents.

We claim:
 1. An integrated circuit (IC) comprising: an input interface for receiving an external command and optional external data; a main memory (MM) coupled to the input interface, the MM comprising: a plurality of memory cells configured to store data; a memory controller (MC) configured to execute an access command to one or more of the plurality of memory cells; and a coprocessor (CP) coupled to the input interface and the MM, the coprocessor comprising: a processing engine (PE) coupled to the MM, wherein the processing engine is configured to execute a command for a subroutine call on data without requiring an interrupt; an output interface for transmitting data, the output interface coupled to the MM and the PE; and an aggregator (“AG”) coupled to the input interface, the aggregator comprising: an aggregator buffer memory partitioned into one or more categories of subroutine call (CSUB) command (CMD) types; and wherein: the AG is configured to: receive a plurality of CSUB CMDs for subroutine calls; and aggregate the CSUB CMDs in a partition of the aggregator buffer memory according to its CSUB CMD type.
 2. The IC of claim 1 further comprising: an allocator (“AL”) coupled to the AG; and wherein: the AL is configured to: identify a type of CSUB CMD for the CSUB CMD received; identify an eligible PE that maps to the type of CSUB CMD of the CSUB CMD received and that has an open processing thread; and assign the CSUB CMD to the eligible PE; and route the CSUB CMDs to eligible PEs on a FIFO basis.
 3. The IC of claim 2 further comprising: a plurality of PEs; and wherein: the AL comprises a mapping table that lists an operational status of all the PEs and that lists a type of CSUB CMD for which each of the PEs is configured to execute.
 4. The IC of claim 3 wherein: a portion of the plurality of PEs is configured with a first quantity of processing threads to execute CSUB code; a different portion of the plurality of PEs is configured with a second quantity of processing threads to execute CSUB code; and the first quantity is different from the second quantity.
 5. The IC of claim 2 wherein: the allocator is not required to assign CSUB CMDs in a sequence that is the same as a sequence of when the aggregator received CSUB CMDs; and the PE is not required to execute CSUB CMDs in a sequence that is the same as the sequence of when the aggregator receives CSUB CMDs.
 6. The IC of claim 1 further comprising: a plurality of ports coupled to the input interface; and wherein: commands are received on each of the plurality of ports; the commands are aggregated by the aggregator in a partition of the aggregator buffer memory according to a respective CSUB CMD type.
 7. The IC of claim 6 further comprising: a plurality of aggregators, coupled to at least one respective input port of the plurality of input ports; and wherein: the plurality of aggregators are coupled to a single allocator; and the single allocator assigns each of the plurality of CSUB CMDs to an eligible PE.
 8. The IC of claim 1 wherein: a category of the CSUB CMD type is configurably established.
 9. The IC of claim 1 wherein: a category of CSUB CMD types includes at least one of: i) CMDs for a lowest prefix match CSUB, ii) CMDs for EXACT MATCH CSUB and SEARCH CSUB, iii) CMDs for a high-priority LPM having a unique CSUB opcode, and iv) CMDs for other CSUBs.
 10. The IC of claim 1 wherein: each of the PEs implements a RISC protocol that executes one instruction per cycle.
 11. The IC of claim 1 wherein: the IC caches neither data nor instructions.
 12. The IC of claim 1 wherein: the processing sequence of CSUB code, or steps therein, can be of any sequence that utilizes the PEs most optimally, e.g., highest duty cycle, fastest throughput, best hierarchical priority processing.
 13. A network system comprising: a packet forwarding engine (PFE); and an integrated circuit (IC) coupled to the PFE, the IC comprising: an input interface for receiving an external command and optional external data; a main memory (MM) coupled to the input interface, the MM comprising: a plurality of memory cells configured to store data; a memory controller (MC) configured to execute an access command to one or more of the plurality of memory cells; and a coprocessor (CP) coupled to the input interface and the MM, the coprocessor comprising: a processing engine (PE) coupled to the MM, wherein the processing engine is configured to execute a command for a subroutine call on data without requiring an interrupt; an output interface for transmitting data, the output interface coupled to the MM and the PE; and an aggregator (“AG”) coupled to the input interface, the aggregator comprising: an aggregator buffer memory partitioned into one or more categories of subroutine call (CSUB) command (CMD) types; and wherein: the AG is configured to: receive a plurality of CSUB CMDs for subroutine calls; and aggregate the CSUB CMDs in a partition of the aggregator buffer memory according to its CSUB CMD type.
 14. The network system of claim 13 further comprising: an allocator (“AL”) coupled to the AG; and wherein: the AL is configured to: identify a type of CSUB CMD for the CSUB CMD received; identify an eligible PE that maps to the type of CSUB CMD of the CSUB CMD received and that has an open processing thread; and assign the CSUB CMD to the eligible PE; and route the CSUB CMDs to eligible PEs on a FIFO basis.
 15. The network system of claim 14 further comprising: a plurality of PEs; and wherein: the AL comprises a mapping table that lists an operational status of all the PEs and that lists a type of CSUB CMD for which each of the PEs is configured to execute.
 16. The network system of claim 13 further comprising: one or more DRAM chips coupled to the PFE in parallel with the IC.
 17. A method of processing data in an IC chip comprising: receiving a command (CMD) at an input interface; communicating the CMD to a main memory (MM); and communicating the CMD to a coprocessor (CP) comprising at least one processing engine (PE) that is coupled to the MM; and wherein: the CMD is communicated in parallel to the MM and the PE; receiving a plurality of CSUB CMDs at an aggregator from the input interface; and aggregating the CSUB CMDs in a partition of an aggregator buffer memory according to a type of the CSUB CMD.
 18. The method of claim 17 further comprising: allocating via an allocator a CSUB CMD to a given PE in the CP assigned to the type of CSUB CMD and according to an availability of a processing thread in the given PE; executing instructions for the CSUB CMD on one of a plurality of threads on one of a plurality of PEs of the CP on a FIFO basis for the type of CSUB CMD.
 19. The method of claim 17 further comprising: creating a mapping table that lists an operational status of a plurality of PEs and that lists a type of CSUB CMD for which each of the PEs is configured to execute.
 20. The method of claim 19 wherein: configuring a portion of the plurality of PEs with a first quantity of processing threads to execute CSUB code; configured a different portion of the plurality of PEs with a second quantity of processing threads to execute CSUB code; and the first quantity is different from the second quantity. 